One of the easiest way to handle CV's are to convert them into number and then process them
In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
%matplotlib inline
In [2]:
data = pd.read_csv('../files/banking_logistic.csv', header=0)
# remove all the rows with missing data
data = data.dropna()
print(data.shape)
print(list(data.columns))
In [3]:
sns.countplot(y='marital', data=data)
Out[3]:
In [57]:
from sklearn.preprocessing import LabelEncoder
num = LabelEncoder()
data['marital'] = num.fit_transform(data['marital'].astype('str'))
sns.countplot(y='marital', data=data)
Out[57]:
If you have bins (such as age group, class group, %age group etc.) of a continuous variable are available in the data set. They can be converted to numbers by
Combine levels: To avoid redundant levels in a categorical variable and to deal with rare levels, we can simply combine the different levels. There are various methods of combining levels. Here are commonly used ones:
Using frequency or response rate: Combining levels based on business logic is effective but we may always not have the domain knowledge. Imagine, you are given a data set from Aerospace Department, US Govt. How would you apply business logic here? In such cases, we combine levels by considering the frequency distribution or response rate.
Dummy coding is a commonly used method for converting a categorical input variable into continuous variable. ‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created. Look at the representation below to convert a categorical variable using dummy variable.
In [ ]:
In [ ]:
In [54]:
d_yes = data[data['y'] == 1]
sns.countplot(y='marital', data=d_yes)
d_no = data[data['y'] == 0]
sns.countplot(y='marital', data=d_no)
Out[54]:
In [ ]:
In [ ]: